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Abstract 

Background: Many studies have revealed correlations between breast tumour phenotypes, variations in gene 
expression, and patient survival outcomes. The molecular heterogeneity between breast tumours revealed by these 
studies has allowed prediction of prognosis and has underpinned stratified therapy, where groups of patients with 
particular tumour types receive specific treatments. The molecular tests used to predict prognosis and stratify 
treatment usually utilise fixed sets of genomic biomarkers, with the same biomarker sets being used to test all 
patients. In this paper we suggest that instead of fixed sets of genomic biomarkers, it may be more effective to use 
a stratified biomarker approach, where optimal biomarker sets are automatically chosen for particular patient 
groups, analogous to the choice of optimal treatments for groups of similar patients in stratified therapy. We 
illustrate the effectiveness of a biclustering approach to select optimal gene sets for determining the prognosis of 
specific strata of patients, based on potentially overlapping, non-discrete molecular characteristics of tumours. 

Results: Biclustering identified tightly co-expressed gene sets in the tumours of restricted subgroups of breast 
cancer patients. The co-expressed genes in these biclusters were significantly enriched for particular biological 
annotations and gene regulatory modules associated with breast cancer biology. Tumours identified within the 
same bicluster were more likely to present with similar clinical features. Bicluster membership combined with 
clinical information could predict patient prognosis in conditional inference tree and ridge regression class 
prediction models. 

Conclusions: The increasing clinical use of genomic profiling demands identification of more effective methods to 
segregate patients into prognostic and treatment groups. We have shown that biclustering can be used to select 
optimal gene sets for determining the prognosis of specific strata of patients. 
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Background 

Breast tumours are currently primarily classified based on 
histological appearance or grade, as well as using certain 
molecular biomarkers. Numerous studies have correlated 
variation in gene expression between individuals to pheno- 
typic diversity in breast tumours [1-8]. Information about 
expression of biologically important genes may improve 
clinical decision-making by improving the prediction of 
clinical outcomes, in comparison to traditional histological 
grade [7,9-14]. Perou et al. [2] pioneered breast tumour 
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classification based on hierarchical clustering of genes, 
identifying biologically meaningful molecular signatures 
associated with features such as cellular proliferation and 
signalling pathway activity. Related work [3,15,16] has 
identified five distinct molecular-based subtypes (the 
PAM50 subtypes): Basal, Her2, Luminal A, Luminal B and 
Normal-like breast tumours. A recent study has clustered 
a combined dataset including germline and somatic copy 
number variation, SNP information and gene expression 
data for some 2000 breast tumours, finding 10 molecularly 
defined subgroups with apparently distinct biology and 
disease-specific survival characteristics [17]. 

Many of these studies use clustering techniques [18-23] 
to group tumours with similar gene expression patterns, 
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and thus to identify clinically relevant molecular biomar- 
kers for tumour classification. However, genes that are co- 
expressed across one subset of tumours may be expressed 
almost independently in another subset [24]. Therefore, 
subtle but important biological features characterised by 
specific gene sets may be missed when viewed across data 
sets made up of a wide range of tumours. 

To overcome this limitation, biclustering algorithms 
group genes and tumour samples simultaneously, to iden- 
tify subsets of genes with similar expression patterns over 
subsets of tumour samples. Biclustering was first applied 
in gene expression profiling studies by Cheng and Church 
[22]. Since then various biclustering approaches have been 
proposed [20,21,25-32]. We reasoned that biclustering 
may be able to identify clinically-significant gene expres- 
sion modules that stratify breast cancers according to 
inter-tumour heterogeneity. To our knowledge, the use of 
biclustering has not previously been investigated for this 
purpose. We used a biclustering technique, cMonkey [32], 
to group breast tumours from 437 patients into subgroups 
which were conditionally dependent on expression profiles 
of specific gene subsets. We subsequently determined the 
association of biclusters with differential survival out- 
comes, and examined the potential for biclusters to im- 
prove prediction of early relapse, in order to test the 
suitability of biclusters as stratified biomarkers that are 
personalised to specific subsets of breast cancer patients. 



Results 

Biclustering breast cancer gene expression data identifies 
tumour subgroups with similar clinical features 

We analysed a previously published dataset which com- 
bines Affymetrix HG-U133A and HG-U133PLUS2 micro- 
array data and matched clinical data from 6 cohorts of 
patients from independent studies. We selected 437 pri- 
mary breast tumours from patients who had not received 
adjuvant or neo-adjuvant chemotherapy, endocrine-based 
therapy or radiotherapy (see Methods). We restricted our 
analysis to non-adjuvant treated tumours so that direct 
associations between molecular information and long- 
term disease outcomes might be observed without the 
confounding affects of different treatments. After proces- 
sing, the dataset consisted of 7756 probe sets (genes) 
across the 437 tumours. The 7 clinical annotations for 
each tumour/patient were: lymph node (LN) status; hor- 
mone receptor status (estrogen receptor (ER) status and 
progesterone receptor (PgR) status); tumour grade; 
tumour subtype; tumour size; patient age. The clinical in- 
formation associated with these 437 tumours is sum- 
marised in Figure 1. More than 90% of these were lymph 
node negative (LN-) tumours, which is consistent with the 
fact that a clinical decision had been made not to treat 
these patients with adjuvant therapy. 

Using cMonkey to bicluster the dataset we obtained 
44 biclusters with an average number of 20 tumours 
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Figure 1 Clinical annotations for patients without adjuvant therapy (437 patients). The proportion of patients/tumours in each clinical 
category is denoted above the bars. Clinical variables include cohort, lymph node status (LN status), estrogen receptor status (ER status), 
Progesterone status (PgR status), tumour histological grade (G1, G2, G3), tumour subtype (Basal, Her2, LumA, LumB, Normal, None), tumour size 
(mm), patient age (years). 



Wang et al. BMC Genomics 2013, 14:102 
http://www.biomedcentral.com/1471 -21 64/1 4/1 02 



Page 3 of 15 




Wang et al. BMC Genomics 2013, 14:102 
http://www.biomedcentral.com/1471 -21 64/1 4/1 02 



Page 4 of 1 5 



(See figure on previous page.) 

Figure 2 Clinical information across all biclusters, sorted in ascending order of early relapse proportion. Panel (A) Patients whose disease free 
survival (DFS) time was less than 2 years are classified into early relapse group (dark blue); Patients whose DFS time was more than 8 years are classified 
into late relapse group (dark red); and patients whose DFS time was between 2 and 8 years were grouped into the intermediate group (green). The 
proportion of samples in each group was computed and biclusters are ranked in ascending order of early relapse proportion. For each bicluster, the 
proportion of samples in each category of (B) Grade, (C) estrogen receptor (ER) status and (D) tumour subtype are presented. 



contributing to each bicluster. To determine whether 
the tumour subgroups and corresponding gene modules 
identified by biclustering were biologically and clinically 
meaningful, we assessed whether tumours allocated to 
the same bicluster had similar clinico-pathological anno- 
tations and clinical outcomes, and whether the genes 
contributing to these biclusters were associated with the 
activity of distinct biological pathways. To make sure 
that the tumour subgroups identified in the biclusters do 
not simply recapitulate clinical cohort we verified that 
biclusters contain data derived from a range of cohorts. 
Then we computed the proportion of tumours in each 
subgroup in each of the 7 clinical categories. Figure 2 
illustrates the proportion of tumour samples in each cat- 
egory of (A) histological grade, (B) ER status, and (C) 
molecular tumour subtype in each of the biclusters, 
ranked in order of the proportion of patients who suf- 
fered early relapse (Disease Free Survival time < 2 years). 
These results show that biclusters do not simply recap- 
itulate any of the already available clinical annotations. 

Analysis of the clinical annotations of tumours in each 
bicluster however reveals similar clinico-pathological 
features. Clinical annotations for each bicluster are sum- 
marised in the supplementary material (Additional file 1: 
Figure SI). For example, a summary of clinical annota- 
tions for biclusters 21 and 7 are shown in Figure 3. Both 
consist of 31 tumour samples collected from all six 
cohorts. In bicluster 21 most tumours (> 90%) have either 
Basal-like (42%) or Her2 (52%) intrinsic subtype; most 
(-87%) are estrogen-receptor negative (ER-), and are grade 
2 (32%) or 3 (58%). Therefore, the clinical information for 
bicluster 21 suggests that tumours in this group are aggres- 
sive and have pathological features associated with poor 
prognosis. By comparison, tumours in bicluster 7 are pre- 
dominantly Luminal-like tumours, i.e. Luminal A (64%) 
and Luminal B (35%), and are estrogen-receptor positive 
(-93%), and histological grade 1 (32%) or 2 (55%). Thus 
the clinical information of tumours in bicluster 7 suggests 
that patients in this group have clinical features that are 
generally associated with less aggressive tumours, asso- 
ciated with good prognosis. Similar patterns can be 
observed in the remaining biclusters (see Additional file 1). 

Overall, these data suggest that biclustering identifies 
subgroups of tumours selected from multiple cohorts, 
which do not simply recapitulate existing clinical anno- 
tations, but which are similar in clinico-pathological 
characteristics. 



Biclustered subgroups are associated with differential 
survival outcomes 

To test the hypothesis that subgroups of tumours identi- 
fied in biclustering are associated with differential pa- 
tient outcomes, we examined the disease free survival 
(DFS) distribution of patients in each bicluster. Figure 4 
shows Kaplan-Meier survival curves for all 44 biclusters 
(grey) superimposed on the survival curves for histo- 
logical grade (panel A) and tumour subtype (panel B). 
Bicluster-associated survival curves show a greater range 
than the grade-associated and subtype-associated sur- 
vival curves. To test whether this range of survival distri- 
butions might arise by chance, we randomly assigned 
(sampling with replacement) the 437 tumours into 44 ar- 
bitrary groups with the same distribution of sizes as found 
in the biclusters. The Kaplan-Meier survival curves of 
these arbitrary groups (see Additional file 1: Figure S2) 
visually suggest that the bicluster-specific survival curves 
are more widely separated than expected at random. In 
order to assess this statistically, we computed the logrank- 
p values for all possible pairings of biclusters and for all 
possible pairings of the randomised groups. A two-sample 
Kolmogorov-Smirnov (KS) test rejected at the 5% signifi- 
cance level the null hypothesis that these distributions 
were the same (p-value = 1.7193 x 10" 16 ). These results in- 
dicate that membership of the biclusters is associated with 
differential survival outcomes. 

Tumours grouped in biclusters are molecularly distinct 

To reveal molecular activities of oncogenic pathways asso- 
ciated with the biclusters, we used a univariate Cox PH 
model to determine the molecular features most strongly 
associated with clinical outcome for each bicluster, identi- 
fying probe sets whose expression was significantly asso- 
ciated with DFS (Cox PH p-value < 0.01). We then used 
the GeneSetDB web tool [33] to identify the significantly 
enriched (GeneSetDB p-value < 0.01) functional annota- 
tions of these genes (see Additional file 2). For example, 
RNAs associated with several of the biclusters were 
enriched for cell cycle annotations (e.g. cell division [GO: 
0051301] in bicluster 1, cell cycle checkpoint [GO: 
0000075] in bicluster 10, mitotic cell cycle [GO: 0000278] 
in bicluster 17 and mitotic cell cycle checkpoint [GO: 
0007093] in bicluster 26), proliferation (e.g. decreased cell 
proliferation [MP:0000352] in biclusters 18, 20 and 26, cell 
proliferation [GO: 0008283] and decreased B cell pro- 
liferation [MP: 0005093] in bicluster 23, increased cell 
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Figure 3 Clinical information of tumour samples in biclusters 21 and 7. Clinical information of tumour samples in (A) bicluster 21 
(31 tumours) and (B) bicluster 7 (31 tumours), summarizing the proportion of tumours belonging to each category of clinical factor, including 
cohort, lymph node status (LN+, LN-), estrogen receptor status (ER+, ER-), Progesterone status (PgR+, PgR-), tumour grade (G1, G2, G3), tumour 
subtype (Basal, Her2, LumA, LumB, Normal, None), tumour size (mm), patient age (years). Bicluster 21 is composed of 31 tumour samples from six 
cohorts, y-axis indicates the proportion of samples in the group belongs to the same category. The number above each bar indicates the 
proportion of samples in the population of 437 non-adjuvant treated patients belongs to the same category. 
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Figure 4 Kaplan-Meier estimate of disease-free survival (DFS) distribution. Kaplan-Meier plots of DFS for all biclustered patient groups 
(44 biclusters) in grey, superimposed on survival curve for tumours classified by histological grades (A) and tumour subtypes (B). 



proliferation [MP:0000351] in bicluster 33). RNAs asso- 
ciated with several of the biclusters were also enriched for 
immune response annotations (e.g. immunological syn- 
apse [GO: 0001772] and wound healing [GO: 0042060] in 
bicluster 7). Genes strongly associated with specific biclus- 
ters were also found to be significantly enriched for 
cancer-associated transcription factors, such as MYC 
(e.g. biclusters 35 and 38), STAT1 (e.g. biclusters 7 and 
28), E2F (e.g. V$E2F_Q3_01 and V$E2F1_Q6: E2F-1 in 
biclusters 25 and 35), SP1 (e.g. biclusters 30 and 33) and 
NFE2L2 (bicluster 23), and enriched in regulatory path- 
ways, which play an important role in breast cancer pro- 
gress, such as Regulatory DNA replication (REACTOME 



Pathway ID: 69304) and p53-Dependent Gl/S DNA dam- 
age checkpoint (REACTOME pathway ID: 69620 and 
69613)) in bicluster 23, Gl/S-Specific Transcription 
(REACTOME pathway ID: 453279 and 69205) in biclusters 
18 and 20, Interferon signalling (REACTOME pathway ID: 
913531) in biclusters 7 and 10, and SREBP signalling 
(Wikipathway PW:0000753) in bicluster 23. Additional file 
3: Table SI summarises biological meaningful annotations 
for biclusters associated with low risk of disease recurrence 
and for biclusters associated with high risk of relapse. 

To compare the gene sets identified for each bicluster we 
searched for biologically significant genes which were dif- 
ferentially expressed between pairs of biclusters, showing a 
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significant difference (logrank p-value < 0.01; Additional 
file 1: Figure S3) in survival distribution (see Methods; 
Additional file 1: Figure S4). As an example, Additional file 
3: Table S2 summarises significant differentially expressed 
genes between bicluster 7 and bicluster 21. We analysed 
these gene classifiers and found them to be highly 
enriched in GO categories including response to estrogen 
stimulus, mammary gland development, defence re- 
sponse and wound healing. (A selected list of biological 
terms significantly enriched with differentially expressed 
genes between biclusters 7 and 21 is summarised in 
Additional file 3: Table S3). The gene classifiers were also 
highly associated with transcription factor STAT group 
(TransFac, p-value < 0.001), and were significantly enriched 
in regulatory pathways such as Downregulated of MTA-3 
in ER-negative Breast Tumors (Biocarta, p-value = 1.80 x 
10" 4 ) and Role of ERBB2 in Signal Transduction and On- 
cology (Biocarta, p-value = 2.50 x 10" 3 ). Breast abscess 
(Disease/Phenotype, p-value = 5.90 x 10" 6 ) and abnormal 
mammary gland alveolus morphology (Disease/Phenotype, 
p-value = 7.90 x 10" 5 ) were also found in genes characteriz- 
ing the difference in survival outcomes between these two 
biclusters. 

Overall, these results imply that the biclusters are mo- 
lecularly defined by functionally coherent, but distinct, 
gene expression modules. 

Bicluster gene classifiers are different to commercial 
Oncotype DX and MammaPrint profiles 

To determine whether currently known biomarkers are 
associated with the biclusters, in Figure 5 (A) we plot gene 
expression for tumours in each bicluster of genes from the 
commercialised Oncotype DX and MammaPrint diagnos- 
tic tests, the genetic grade signature proposed by Ivshina 
et al. [34], and other key oncogenic markers. The figure 
indicates three broad groupings of biclusters, which have 
significant differences in disease-free survival, shown in 
Figure 5 (B). The first 10 biclusters (cyan bar) are asso- 
ciated with low risk of disease recurrence (longest DFS). 
The last 10 biclusters (magenta bar) are associated with 
high risk of disease recurrence (shortest DFS). The heat- 
map suggests that these known biomarkers are not able to 
distinguish patient subgroups identified by biclustering. 

We then looked at the genes and related pathways iden- 
tified as being differentially expressed between biclusters 
associated with low and high survival prognosis. Interest- 
ingly, many of these gene classifiers are not found in cur- 
rently available prognostic gene classifiers for breast 
cancer. For each pair of biclusters showing differentially 
survival outcomes, we used prediction analysis of microar- 
rays (PAM) [35] to identify those of probe sets which were 
able to accurately characterise survival outcome differ- 
ences between the biclusters (Additional file 4). Among 
the gene classifiers for bicluster 7 and bicluster 21 



(see Additional file 3: Table S4), GRB7 and ERBB2 (highly 
expressed in bicluster 21) and ESR1 and BCL2 (highly 
expressed in bicluster 7) are members of the Oncotype 
DX gene list. Overall, however, we found only a small pro- 
portion of gene classifiers that coincide with the commer- 
cial gene lists. 

Biclusters show strong prognostic association with breast 
cancer patient survival in comparison to conventional 
clinical markers 

We used a univariate Cox PH model to examine the as- 
sociation between conventional prognostic factors (in- 
cluding hormone receptor status, lymph node status, 
tumour size and age) and disease free survival (DFS) 
time. Clinical variables were entered as binary data while 
tumour size and age were discretized into categorical 
variables (see Methods). The strength of association be- 
tween clinical variables and the survival outcome, as 
quantified by Cox PH p-value, was compared to prog- 
nostic significance of biclustered tumour groups, using 
the same univariate Cox PH model. Bicluster membership 
was found to have a strong association with DFS (univari- 
ate Cox PH p = 8.67 x 10" 5 ), after LN status (p = 1.71 x 10" 5 ) 
and tumour size (p = 7.44 x 10" 5 ), shown in Additional 
file 3: Table S5 (a). We also found that biclusters give 
better prognostication of relapse compared to the 
conventional molecular tumour subtype (univariate 
Cox PH p = 0.000427). 

Next, we used a multivariate Cox PH model to estimate 
the association of biclusters with DFS in combination 
with the conventional clinical factors. We found that 
biclusters were significantly statistically associated with 
DFS time (Additional file 3: Table S5 (a), multivariate 
Cox PH p = 0.00103; compared to randomly allocated 
groups, Additional file 3: Table S5 (f), multivariate Cox 
PH p = 0.6163). 

Finally, we also found that biclusters show stronger 
prognostic association with disease recurrence within 
pathology-specific tumour subsets (Additional file 3: 
Table S5 (b)-(e): LN- patient subgroup, univariate Cox PH 
p = 0.000183; ER+ patient subgroup, p = 0.00406; Grade 2 
tumour subgroup, p = 0.000171; Grade 3 tumour sub- 
group, p = 0.0153). In each of these pathology-specific 
tumour subsets, multivariate Cox PH analysis also demon- 
strated that the tumour biclusters were significantly in 
associated with DFS compared to molecular tumour sub- 
type when combined with other conventional clinical vari- 
ables (Additional file 3: Table S5 (b)-(e)). 

These univariate and multivariate Cox PH analyses 
suggest that the classification of tumours using bicluster- 
ing can, alone or in combination with other clinical fac- 
tors, be a significant prognostic variable, especially in 
pathology-specific patient subsets, in comparison to con- 
ventional clinical-pathological factors. 
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Figure 5 Gene expression for tumours in each bicluster of known biomarkers. Panel (A): Heatmap of oncogenic pathway activity across 
biclusters, sorted by the proportion of tumour samples in early relapse (DFS < 2 years), as for Figure 2 (A). The colour bar labelling is the same as 
used in Figure 2. Each row represents a gene signature and each column represents a tumour. The red colour indicates that the expression of 
gene is higher than the average expression across all tumours; the green colour indicates that the expression of gene is lower than the average 
expression across all tumours. Panel (B): Kaplan Meier (KM) plots for survival estimation of patients in first 10 biclusters (KM curve in cyan), last 10 
biclusters (KM curve in magenta) and the remainder of the biclusters (KM curve in blue). 



Biclusters increase prediction of patient survival 

The ultimate goal of tumour classification is to improve 
prediction of an individuals risk of relapse, so as to aid the 
clinician in decision making on treatment strategy (e.g. 
whether a patient should or should not be given adjuvant 
therapy). To test whether biclusters improve prediction of 
patient survival, we used a conditional inference tree 
model to predict disease recurrence. We chose to exam- 
ine disease recurrence (relapse) within 2 years of primary 
surgery. Therefore, we defined a dichotomous clinical 
outcomes decision tree, in which early relapse equates to 
disease free survival time < 2 years, and late relapse to 
DFS time > 8 years (defined in this manner the data set 
contains 92 early relapse and 211 late relapse patients). 
Bicluster membership (representing molecular informa- 
tion) and clinical information including hormone receptor 
status (ER status), lymph node status, histological grade, 
tumour size and patient age were included as binary vari- 
ables for each tumour. 



Initially we generated a decision tree based on conven- 
tional clinical covariates only. Figure 6 (A) demonstrates 
that lymph node (LN) status was found to be the most 
important clinical variable for estimation of patient 
clinical outcomes (p < 0.001) in comparison to the 
remaining clinical variables, followed by Tumour Grade 
(p < 0.001) and ER status (p = 0.088). Since -90% of the 
tumours in our dataset are LN negative tumours, we con- 
structed a decision tree for the LN- branch, shown in Fig- 
ure 6 (B). In LN- tumours, histological grade was the most 
significant variable (p < 0.001), followed by tumour size (p 
= 0.027) and ER status (p = 0.183). Next we constructed a 
decision tree for LN- tumours using both biclusters and 
clinical variables. Figure 7 shows that the bicluster 36 
(p = 0.001) and bicluster 21 (p < 0.001) were significant mo- 
lecular covariates in LN- tumours compared to the other 
clinical variables. We determined similar decision trees for 
continuous DFS time without dichotomisation (data not 
shown). 
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To assess the performance of the conditional inference 
tree predictors, we performed a cross-validation by gen- 
erating 50 random splits of the data into training (80%) 
and test (20%) sets, ensuring that both training and test 
sets contained similar proportion of early and late re- 
lapse patients as the full data set. Biclusters derived by 
applying cMonkey to the training data only were com- 
bined with clinical variables and used to generate the de- 
cision tree. Performance was assessed by comparing 
relapse time predicted by the decision tree to the actual 
survival outcomes of patients in the test set. Tumours in 
the test set were assigned to biclusters by calculating 
correlation with the first principal component of the 
gene expression profile of each bicluster (see Methods). 
To assess the contribution of biclusters to prediction of 
relapse times, we compared these results with relapse 
times predicted using decision trees constructed from 
clinical variables only. The average rate of correctly pre- 
dicted early relapse increased from 41% to 58% by in- 
corporating biclusters into the predictors. 

To assess the performance of the predictors for a 
clinically-determined subset of tumours we estimated of 
early relapse for patients with moderately differentiated 
grade 2 and poorly differentiated grade 3 tumours. We 
found 17% and 20% improvements in prediction of early 
relapse in these patients by including bicluster informa- 
tion (increase from 39% to 56% for patients with grade 2 
tumours and from 46% to 66% for patients with grade 3 
tumours). 

We note that the conditional inference tree prediction 
model performed consistently with different choice of 
survival time boundaries (data not shown): no significant 
deviation in the performance was observed when differ- 
ent cut-off times for relapse were chosen. We also con- 
sidered an alternative prediction model. Bovelstad et al. 
[36,37] performed a comparative study of six Cox 
regression-based survival prediction methods, including 
univariate selection, supervised principal component 
analysis, partial least squares regression, ridge regression 
and lasso, finding that ridge regression presented the 
best overall performance. We repeated the above ana- 
lysis using a Cox regression-based ridge regression 
model. A higher median area under the ROC curve 
(AUC) [38] was observed, in particular for DFS time less 
than 2 years, across 50 random data splits when the 
model incorporated bicluster information (Figure 8), in- 
dicating better prediction of early relapse compared to 
the model with clinical covariates only, qualitatively con- 
sistent with the results from the conditional inference 
tree. 

Taken together, these results suggest that combining 
conventional clinical information with bicluster mem- 
bership is able to provide a better prediction of early re- 
lapse in breast cancer patients. 



Discussion 

In this study we have shown biclustering to be a useful ap- 
proach to identifying subgroups of tumours, based on the 
use of stratified biomarkers that are personalised to spe- 
cific subsets of patients. Biclustering determines gene 
modules and related clinical features which are important 
in determining phenotypic and clinical outcomes in those 
patients, but not in others. 

In particular, we have applied biclustering to a large 
breast cancer expression data set that includes careful 
clinical annotations, and have used this method to identify 
clusters of breast tumours conditional on common ex- 
pression profiles across a set of genes. We also demon- 
strated that biclusters do not simply recapitulate any 
obvious single, known clinical covariate (Figure 3 and 
Additional file 1: Figure SI), but instead represent a group 
of tumours co-expressing a set of genes that are associated 
with similar clinical presentation and give rise to recur- 
rence risk. We found that biclusters have strong prognos- 
tic association with patient survival outcomes, especially 
in pathology-specific tumour subsets, compared to con- 
ventional molecular tumour subtype. The predictive value 
of biclustering in the prediction of early relapse of disease 
was observed by comparing prediction of conditional in- 
ference tree models constructed from clinical covariates 
with those constructed from a combination of clinical cov- 
ariates and biclusters. On average we found that including 
biclusters provided close to a 20% improvement over pre- 
diction of early relapse using clinical covariates only. 

We identified 44 biclusters in the dataset studied here. 
Comparisons showed that the tumours comprising some 
of these biclusters were enriched for similar clinical fea- 
tures, showing no significant difference in their Kaplan- 
Meier survival curves, despite subtle differences in the 
gene signatures contributing to the biclusters, i.e. different 
molecular signatures contributing to similar phenotype or 
disease state. Such differences may provide important in- 
formation for prognosis and may contribute to our under- 
standing of genetic regulatory mechanisms underlying 
patient outcomes. 

Our analysis suggests that unsupervised biclustering is 
able to discover tumour subgroups determined from 
biologically meaningful gene sets. However, overall we 
found only a small proportion of these gene classifiers 
coincide with the 21-mRNA Oncotype DX RT-PCR 
assay [39] and the 70-mRNA MammaPrint [40] gene 
lists, which have been developed for determining the risk 
of breast tumour recurrence in patients with stage I or 
II node-negative breast cancer. By examining the expres- 
sion pattern of Oncotype DX and Mammaprint profiles, 
as well as genetic grade markers proposed by Ivshina 
et al. [34], across all biclusters, we found that the expres- 
sion patterns of these gene markers were associated with 
two broad groups of patients with differential clinical 
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Figure 6 The schematic diagram of the conditional inference tree of model using only conventional clinical variables. Panel (A) indicates that 
Lymph Node status is the most significant factor followed by Grade and ER status. The survival outcomes are displayed in bar graphs in which Groupl 
represents the proportion of patients with DFS time < 2 yrs (early relapse) and Group2 represents the proportion of patients with DFS time > 8 yrs 
(late relapse). The tree is trimmed at the significance level of 0.5. Panel (B) illustrates a decision tree constructed for LN- tumours only. 
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Figure 7 The schematic diagram of the conditional inference tree constructed for LN- tumours incorporating bicluster membership. 

The survival outcomes are displayed in bar graphs in which Groupl represents the proportion of patients with DFS time < 2 yrs (early relapse) 
and Group2 represents the proportion of patients with DFS time > 8 yrs (late relapse). The tree is trimmed at the significance level of 0.5. 



outcomes (see Figure 5 (A)). Membership of these 
groups is much broader than that of the biclusters we 
have identified. Given the functional enrichment ana- 
lyses of the bicluster gene classifiers showing strong bio- 
logical significance in breast cancer, this suggests a 
potential avenue for identifying new mRNA-based pro- 
files for predicting disease recurrence. 

Biclusters may overlap one another, both in the tumours 
and the gene signatures which determine them. The clin- 
ical and molecular heterogeneity within breast cancer sug- 
gests that it is unlikely that one may accurately determine 
unique characteristics of tumours or predict patient clin- 
ical outcomes using a single or a small group of gene mar- 
kers. Different combinations of molecular pathways may 
be activated in, and therefore significantly associate with 
the recurrence of the disease in different patients. There- 
fore, biclusters which are determined from multiple and 
potentially overlapping gene expression profiles may be 
more relevant in the context of personalised medicine, 
which requires finer groupings of patients with specific 
molecular and clinical characteristics. 

In performing cross-validation analysis we assigned 'test' 
tumours to biclusters derived from the 'training' set of 
tumours by ranking correlation with the first principal 
component of the gene signature of each bicluster. We 
found relatively high correlations between all members of 
the test set and all biclusters. Nevertheless, samples were 
allocated to biclusters with consistency in both molecular 
and clinical characteristics. In practice, new patients arriv- 
ing in the clinic could be assigned to biclusters on the basis 
of both tumour gene expression and clinical covariates 



(we have shown that bicluster membership is associated 
with clinical variables, so this may add further useful infor- 
mation in associating new patients with tumour sub- 
groups). Furthermore, the power of the predictive model 
can be improved by including new cases into the current 
'training' population once clinical outcomes are known, 
hence re-determining biclusters periodically. 

Conclusions 

With the increasing clinical use of genomic profiling, 
which is producing very large amounts of data for large 
numbers of patients, methods to use this data to segre- 
gate patients into prognostic and treatment groups is in- 
creasingly important. Our study proposes a biclustering 
approach to identify stratified biomarkers, without pre- 
selection of gene markers or molecular pathways. Our 
results show the utility of biclustering in identifying of mo- 
lecularly and clinico-pathologically distinct tumour sub- 
types that are strongly associated with differential clinical 
outcomes. The prognostic significance of biclustered pa- 
tient groups was observed in well-, moderately- and poorly- 
differentiated disease. Bicluster-specific patient stratification 
characterised by different gene profiles fits well within the 
context of personalised medicine: different sets of molecu- 
lar pathways being active in different tumours. 

Methods 

Non-adjuvant treated breast cancer dataset 

This dataset contains the expression profiles of more 
than 20,000 probe sets across 960 patient samples, mea- 
sured using Affymetrix HG-U133A and HG-U133PLUS2 
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GeneChip arrays, quantile normalized across all patient 
samples, as described in Lasham et al [41]. This study 
described gene expression microarray data and matched 
clinical data for primary breast tumour patients assembled 
from multiple published studies (GSE3494, GSE1456, 
GSE6532, GSE7390, GSE4922, GSE36771, and GSE36772). 

In order that direct associations between molecular in- 
formation and long-term disease outcomes might be 
observed without the confounding affects of different 
treatments, we restricted our analysis to patients who had 
not received adjuvant or neo-adjuvant chemotherapy, 
endocrine-based therapy or radiotherapy, resulting in a 
dataset for analysis containing 437 tumours from the fol- 
lowing cohorts (Figure 1): GSE1456 (33 samples), 
GSE3494 (122), GSE6532_U133A (44), GSE7390 (198), 
NZ_U133A (30 samples; part of GSE36771) and Singapore 
(10 samples; 6 from GSE4922 and 4 from GSE36772). 

We removed all genes without significant variation 
across samples by estimating a noise threshold based on 
the mean signal of the BioB 3' probe set across all chips, 
and retaining all probes that were expressed above this 
noise level in at least 10% of chips. Since Affymetrix 
HG-U133A microarrays include multiple probe sets 
annotated with same official gene symbol (OGS), to 
avoid inclusion of outlying probe sets and to reduce 
computational cost, probe sets without OGS annotation 
in version 2.6.3 of the Bioconductor hgul33a.db annota- 
tion package were also removed. We retained 7756 
probe sets for analysis - one probe set each OGS (whose 
expression had the highest correlation to the median ex- 
pression of all probe sets with the same annotation). 



Finally, to remove non-biological variation associated 
with cohort, ComBat [42] processing was then applied 
to adjust for cohort-correlated batch effect across the 
non-adjuvant treated tumour data set. 

Biclustering using cMonkey to identify molecular-based 
tumour subgroups 

We used the cMonkey biclustering algorithm [32] to iden- 
tify subgroups of non-adjuvant treated tumours with co- 
herent patterns of gene expression over a subset of genes. 
cMonkey performs 2D clustering on both genes and con- 
ditions (microarrays) by using a Markov chain process that 
iteratively adds or removes genes and conditions to or 
from a cluster, initially with a single gene, until the optimal 
cluster is achieved [43]. Using this approach, genes and 
tumour samples can be allocated into more than one 
bicluster, consistent with the principle that genes may par- 
ticipate in more than one biological process or gene ex- 
pression module [44]. 

We note that we used cMonkey to identify clusters of 
tumours conditional on common expression profiles across 
a set of genes, rather than the standard usage (to identify 
co-expressed genes across a subset of conditions), and 
therefore the data matrix was presented in a transposed 
form in comparison to normal usage. Since the genome in- 
formation for Homo sapiens was not available for the new 
R package version of cMonkey (version 4.8.4) [45], the 
motif and network information was switched off in the in- 
put of the cMonkey command line. 

We chose to use cMonkey because of its computational 
efficiency in dealing with high-dimensional microarray 
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data, and its ability to detect co-expression patterns by 
considering the membership of tumours and genes simul- 
taneously without large overlap between biclusters. How- 
ever, a limitation is the relatively narrow distribution in 
the number of genes used to determine each bicluster - a 
known current limitation of cMonkey. While this biases 
the analysis towards biclusters defined by gene sets of 
similar size, the gene membership was shown to be differ- 
ent between biclusters. 

Differences in survival distribution - logrank test 

We used the Kaplan-Meier plot to estimate disease free 
survival (DFS) distributions. The logrank test was used 
to assess the differences in survival distribution between 
any two sets of survival curves (smaller p-value indicates 
stronger statistical evidence against the null hypothesis 
that two curves were drawn from the same survival dis- 
tribution). The Kaplan-Meier estimation and logrank 
test were performed using MatLab scripts kmplot and 
logrank [46]. 

PAM to determine molecular classifiers for each bicluster 

Prediction analysis of microarrays (PAM) [35] was used 
to determine the minimal set of genes classifying each 
pair of biclusters associated with low and high risk of re- 
currence, with the lowest misclassification rate. Initially, 
we ran the PAM algorithm with all 7756 probe sets as 
input in the low risk group to high risk group compari- 
son and obtained a minimal set of probe sets, which 
gave the lowest misclassification error. All probes in the 
minimal set are differentially expressed between the low 
and high risk-associated biclusters. This was executed 
using the pamr package in R. 

Cox proportional hazards regression 

The prognostic effect of molecular and clinical-pathological 
covariates on patient survival was assessed within the 
framework of Cox proportional hazard (PH) regression. A 
univariate Cox PH model was used to examine the individ- 
ual association between each covariate and survival time. A 
multivariate Cox PH model was used to test the prognostic 
effect of combinations of covariates. The significance of as- 
sociation was quantified by the Cox PH p-value: smaller 
p-values indicate the stronger contributions of molecular 
and clinical variables to survival outcomes. The computa- 
tion of the association between gene signatures and biclus- 
ters was implemented using the coxphfit function in 
MatLab and prognostic association was assessed using the 
survival package in R. 

Cross validation: Association of test data with biclusters 

In order to determine the association between a tumour 
from the 'test' data set and the biclusters calculated from 
the 'training' set, we used the first principal component 



(PCI) as a representative gene expression profile for 
each tumour sample in each bicluster. We computed the 
Pearson s correlation between these profiles and the gene 
expression profiles of tumours in the 'test' set for each 
bicluster. 'Test' tumours were associated with biclusters 
for which the correlation coefficient was larger than a 
pre-defined threshold value (> 0.9), such that each 'test' 
tumour sample could be allocated into more than one 
bicluster. 

Survival prediction model 

To detect whether classification of tumours using biclus- 
tering can be used to improve the prediction of survival 
with censoring, we used two different approaches to esti- 
mate a regression relationship between survival time and 
the status of multiple covariates including both clinical 
and molecular variables (biclusters): (1) a conditional in- 
ference tree [47] constructed by recursive binary partition- 
ing process and (2) ridge regression method based on Cox 
regression model [48]. For these models the covariates 
were entered as categorical variables, i.e. ER status (ER+, 
ER-); PgR status (PgR+, PgR-); LN status (LN+, LN-); 
Histological Tumour Grade (Gl, G2, G3); Tumour Size 
(size < 10 mm, 10 mm < size 20 mm, 20 mm < size 
<40 mm, and size >40 mm); Patient Age (age < 40 years, 
40 years < age < 50 years, 50 years < age < 60 years, and 
age > 60 years); Bicluster Membership (0,1). The condi- 
tional inference tree was implemented through ctree func- 
tion in the R package party, MatLab code for the Cox 
regression based ridge regression method is provided by 
Bovelstad et al. [36]. 

The computational methods used in this study will be 
made available on request to the authors. 

Additional files 



Additional file 1: Figure SI. Clinical information for each individual 
bicluster determined by cMonkey from gene expression data of breast 
cancer patients who had not received adjuvant treatment (437 patients). 
The clinical variables include cohort, lymph node status (LN.status: LN+, 
LN- and NA), estrogen receptor status (ER.status: ER+, ER- and NA), 
progesterone receptor status (PgR.status: PgR+, PgR- and NA), tumour 
grade (Grade: G1, G2, G3 and NA), molecular subtype (Subtype: Basal, 
Her2, Luminal A, Luminal B, Normal and None), tumour size (Size in mm) 
and patient Age (in years). Missing information is noted as NA (Not 
Available). The percentage noted on each bar indicates the percentage 
of tumour samples of the specific category in each bicluster. Figure S2. 
Kaplan-Meier estimates of survival distribution of randomly selected 
patient groups. Kaplan-Meier estimate of disease free survival (DFS) for 44 
groups with arbitrary patient membership (all patients were randomly 
allocated into 44 groups), superimposed on survival curve of (A) 
histological grades and (B) tumour subtypes, respectively. Distribution of 
bicluster-associated survival curves in Figure 4 is much broader than the 
distribution of patient group-associated survival curves by chance. Figure 
S3. Kaplan-Meier plots of pairs of biclusters showing differences in 
survival. Kaplan-Meier plots estimate disease free survival (DFS) 
distribution of pairwise biclusters comparison with significant difference 
in their survival distribution (logrank p value < 0.01). In each subplot, blue 
curve presents patients' group associated with good prognosis while 
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curves in green represents patients' group associated with poor 
prognosis. Logrank p-values denotes on the top of each subplot 
indicates the significance of difference in survival distribution between 
two biclusters. Figure S4. Volcano plots showing genes that are 
differentially expressed between biclusters which have significant 
difference (logrank p values < 0.01) in their survival curves, associated 
with good and poor prognosis respectively. The statistical significance of 
differential gene expression was quantified by two sample r-test p-value. 
The fold change in expression level of each gene between two biclusters 
was evaluated. We identified genes whose expression level was 
significantly different by at least 2-fold change between biclusters 
associated with good and poor survival. We plot the statistical 
significance of differentially expressed genes against fold change in the 
expression levels between low and high risk biclusters to identify both 
statistically and biologically significant genes. By setting the cut off of 
significance level at 0.05 (horizontal red dashed line) and fold change at 
2 (two vertical red dashed lines), the volcano plots identified genes with 
elevated expression levels, which represents the top-most significantly 
differentially expressed genes between biclusters associated with good 
and poor survival. The volcano plot was generated by mavolcanoplot in 
MatLab. Figure S5. Functional enrichment heatmap for PAM gene 
classifiers of biclusters which have significant difference (logrank p values 
< 0.01) in their survival curves, associated with good and poor prognosis. 

Additional file 2: Summarizes the significantly enriched 
(GeneSetDB p-value < 0.01) functional annotations of genes that 
were most strongly associated with clinical outcome for each 
bicluster. 

Additional file 3: Table SI. Functional annotation enrichment in 
biclusters associated with low risk of relapse and in biclusters associated 
with high risk of relapse. Table S2. Differentially expressed genes 
between Bicluster 7 and Bicluster 21. Genes are ranked in ascending 
order of p-values reported from two-sample r-test, and the related 
statistics scores of each gene, including t-scores, p-values, False Discovery 
Rate (FDR), q-values and Benjamini-Hochberg (BH) adjusted FDR are 
summarized. Table S3. Enrichment analysis of differentially expressed 
genes between Bicluster 7 and Bicluster 21 determined by GeneSetDB at 
the significance level < 0.001. Table S4. Gene Classifiers determined by 
PAM - the maximum number of probe sets which can accurately 
characterise Bicluster 7 and Bicluster 21. The highlighted genes overlap 
with the OncotypeDx commercial gene list. Table S5. Multivariate 
analyses of prognostic importance of biclusters in comparison to 
conventional clinical factors, molecular tumour subtypes, and genetic 
grade. Significance of biclustered classification for prognosis was 
compared to chance by randomly assigned tumours into 44 arbitrary 
groups and estimating the association between membership of these 
arbitrary groups and DFS. Univariate Cox PH analysis revealed no 
statistically significant association between arbitrary group membership 
and survival outcome. Significance is indicated in the ranges: 0 < 0.001 
'***'; 0.001 < 0.01 '**'; 0.01 < 0.05 '*'; 0.05 < 0.1 0.1 < 1 Table S6. 
Summary of clinical information of 437 non-adjuvant treated patients. 

Additional file 4: Summarises probe sets which were able to 
accurately characterise survival outcome differences between the 
biclusters. 
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